Analysing the content of Web 2.0 documents by using a hybrid approach
Analysing the content of Web 2.0 documents by using a hybrid approach
 
  User involvement in Web 2.0 has made a significant contribution to the increase in the amount of multimedia content on the Web. Images are one of the most used media, shared across the network to mark user experience in daily life. Interactive applications have allowed users to participate in describing these images, usually in the form of free text, thus gradually enriching the images' descriptions. Nevertheless, often these images are left with crude or no description. Web search engines such as Google and Yahoo provide text based searching to find images by mapping query concepts with the text description of the image, thus limiting the information discovery to material with good text descriptions. A similar issue is faced by text based search provided by Web 2.0 applications. Images with less description might not contain adequate information while images with no description will be useless as they will become unsearchable by a text based search. Therefore, there is an urgent need to investigate ways to produce high quality information to provide insight into the document content. The aim of this research is to investigate a means to improve the capability of information retrieval by utilizing Web 2.0 content, the Semantic Web and other emerging technologies. A hybrid approach is proposed which analyses two main aspects of Web 2.0 content, namely text and images. The text analysis consists of using Natural Language Processing and ontologies. The aim of the text analysis is to translate free text descriptions into a semantic information model tailored to Semantic Web standards. Image analysis is developed using machine learning tools and is assessed using ROC analysis. The aim of the image analysis is to develop an image classifier exemplar to identify information in images based on their visual features. The hybrid approach is evaluated based on standard information retrieval performance metrics, precision and recall. The example semantic information model has structured and enriched the textual content thus providing better retrieval results compared to conventional tag based search. The image classifier is shown to be useful for providing additional information about image content. Each of the approaches has its own strengths and they complement each other in different scenarios. The thesis demonstrates that the hybrid approach has improved information retrieval performance compared to either of the contributing techniques used separately.
  
    
      Zakaria, Lailatul Qadri binti
      
        62776aaa-a8fb-402f-9c94-2baf967a8a3c
      
     
  
  
   
  
  
    
      June 2011
    
    
  
  
    
      Zakaria, Lailatul Qadri binti
      
        62776aaa-a8fb-402f-9c94-2baf967a8a3c
      
     
  
    
      Hall, Wendy
      
        11f7f8db-854c-4481-b1ae-721a51d8790c
      
     
  
    
      Lewis, Paul
      
        7aa6c6d9-bc69-4e19-b2ac-a6e20558c020
      
     
  
       
    
 
  
    
      
  
 
  
  
  
    Zakaria, Lailatul Qadri binti
  
  
  
  
   
    (2011)
  
  
    
    Analysing the content of Web 2.0 documents by using a hybrid approach.
  University of Southampton, Electronics and Computer Science: Web & Internet Science, Doctoral Thesis, 179pp.
  
   
  
    
      Record type:
      Thesis
      
      
      (Doctoral)
    
   
    
    
      
        
          Abstract
          User involvement in Web 2.0 has made a significant contribution to the increase in the amount of multimedia content on the Web. Images are one of the most used media, shared across the network to mark user experience in daily life. Interactive applications have allowed users to participate in describing these images, usually in the form of free text, thus gradually enriching the images' descriptions. Nevertheless, often these images are left with crude or no description. Web search engines such as Google and Yahoo provide text based searching to find images by mapping query concepts with the text description of the image, thus limiting the information discovery to material with good text descriptions. A similar issue is faced by text based search provided by Web 2.0 applications. Images with less description might not contain adequate information while images with no description will be useless as they will become unsearchable by a text based search. Therefore, there is an urgent need to investigate ways to produce high quality information to provide insight into the document content. The aim of this research is to investigate a means to improve the capability of information retrieval by utilizing Web 2.0 content, the Semantic Web and other emerging technologies. A hybrid approach is proposed which analyses two main aspects of Web 2.0 content, namely text and images. The text analysis consists of using Natural Language Processing and ontologies. The aim of the text analysis is to translate free text descriptions into a semantic information model tailored to Semantic Web standards. Image analysis is developed using machine learning tools and is assessed using ROC analysis. The aim of the image analysis is to develop an image classifier exemplar to identify information in images based on their visual features. The hybrid approach is evaluated based on standard information retrieval performance metrics, precision and recall. The example semantic information model has structured and enriched the textual content thus providing better retrieval results compared to conventional tag based search. The image classifier is shown to be useful for providing additional information about image content. Each of the approaches has its own strengths and they complement each other in different scenarios. The thesis demonstrates that the hybrid approach has improved information retrieval performance compared to either of the contributing techniques used separately.
         
      
      
        
          
            
  
    Text
 Lailatul_Qadri_PhD_Thesis.pdf
     - Other
   
  
  
 
          
            
          
            
           
            
           
        
        
       
    
   
  
  
  More information
  
    
      Published date: June 2011
 
    
  
  
    
  
    
  
    
  
    
  
    
  
    
  
    
     
        Organisations:
        University of Southampton, Web & Internet Science
      
    
  
    
  
  
  
    
  
  
        Identifiers
        Local EPrints ID: 194917
        URI: http://eprints.soton.ac.uk/id/eprint/194917
        
        
        
        
          PURE UUID: e769d3fa-7295-40bc-828d-7e439c4535cb
        
  
    
        
          
        
    
        
          
            
              
            
          
        
    
        
          
            
          
        
    
  
  Catalogue record
  Date deposited: 12 Aug 2011 14:16
  Last modified: 15 Mar 2024 02:33
  Export record
  
  
 
 
  
    
    
      Contributors
      
          
          Author:
          
            
            
              Lailatul Qadri binti Zakaria
            
          
        
      
        
      
          
          Thesis advisor:
          
            
              
              
                Paul Lewis
              
              
            
            
          
        
      
      
      
    
  
   
  
    Download statistics
    
      Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
      
      View more statistics